Aarthi Ravindran.20/11/2019 - RStudio Exercise 4: Tasks and Instructions Note: for exercise 5 data wraningling please check the below link
Analysis Exercises - 15 marks
#load Boston data from Mass package (Exercise 4.2)
#The following R packages are installed already, which are mandatory to run this analysis
#install.packages("MASS")
#install.packages("corrr")
library(MASS)
library(tidyr)
library(corrr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("ggobi/ggally")
#install.packages("GGally")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
data(Boston)
#Explore the Boston data (Exercise 4.2)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#Graphical Overview of data (Exercise 4.3)
pairs(Boston)
#Calculate the correlation of variables in the matrix
cor_matrix <- cor(Boston)%>%round(digits = 2)
#print correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
#visualize the correlation matrix
corrplot(cor_matrix, metho="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
#Another package for correlation matrix
# calculate the correlation matrix and round it
cor_matrix_gg <- ggcorr(Boston, palette = "RdBu", label = TRUE)
cor_matrix_gg
*Description of Boston Data:*
Bostan data is about the housing values in suburbs.This data is having 506 rows and 14 columns (using dim()), the columns are explained below,
crim - per capita crime rate by town, zn - proportion of residential land zoned for lots over 25,000 sq.ft., indus - proportion of non-retail business acres per town, chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise), nox - nitrogen oxides concentration (parts per 10 million), rm - average number of rooms per dwelling, age - proportion of owner-occupied units built prior to 1940, dis - weighted mean of distances to five Boston employment centres, rad - index of accessibility to radial highways, tax - full-value property-tax rate per $10,000, ptration - pupil-teacher ratio by town, pratio - pupil-teacher ratio by town, black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town, lstat - lower status of the population (percent), medv - median value of owner-occupied homes in $1000s.
Boston Data Variables Interpretation from Data Exploration and Graphical Plots:
From str() command we can see all the variables in the files are numeric or integer.
From summary() we can see minimum, maximum, medium, 1st and 3rd quartile of each varibales. For example, the crime rate ranges from minimum of 0.006 and maximum is 88.97, maximum of 27.74 proportion of non-retail busincess acer in boston, each housing has minimum of 3 room numbers to maximum of 8 rooms, the owners had them with different time points from minimum of 2.9 to maximum of 100 and so on.
The relationship between two dataset or variables in one data can be explained by the correlation among them. To define briefly, When two sets of data are strongly linked together we say they have a High Correlation. The word Correlation is made of Co- (meaning “together”), and Relation. Correlation is Positive when the values increase together, and. Correlation is Negative when one value decreases as the other increases.
From graphical plot pairs(), corrplot(), ggcor() which are from the correlation matrix, we can see some of the parameters are correlated either positively or negatively.
For example from the correlation plots we can clearly see,
crim is postively correlated with rad and tax,
zn is negatively correlated with indus, nox, age and postively correlated with dis
nox is postively correlated with age, rad, tax, istat and negatively correlated with dis, medv and so on.
From correlation plot we will be able to see clearly than the pairs, also ggcor cor also gives us the correlation values.
########################################################################################################################
#standardize the dataset
boston_scaled <- scale(Boston) #in scaling the outlier are removed
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
#Create the catagorical variable of crime rate from scaled data
#create the quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#create the categorical variable of crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# boston_scaled is available
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
#Check the structure of the new test data
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
dim(boston_scaled)
## [1] 506 14
Scaling Results: The scaling of data is important for data analysis, it remove the oulier and scale the data to fit in a modle, in this way we will avoid the bias in data analysis.
AS result of scaling we can see changes in maximum, minimum, 1st, 3rd quartile, mean and median of variables.
For example The max() of crime rate from 88.9 decreased to 9.92.
Then in this data, we know all the variables are numeric/integer, we are changing the crime rate as catagorical variable and split the data into train and test to check the how well the model works. Here 80% of data belong to train set, it is done by taking 80% of rows.
As the result of converting the crime rate into categorical variable, now we can clearly see data into four groups where 127 row in low crime rate, 126 row in medium low crime rate, 126 row in medium high and 127 row in high crime rate. The data is almost divided equally based on the crime rate.
#############################################################################################################################
Discussion: Before going into Exercise 4.5, we have to check whether the data is having categorical variables. From str() we can clearly see one of the variable “crime” has factor with four levels of low, med_low, med_high, high which is mandatory for Linear Discriminanat Analysis.
Linear Discriminant Analysis:
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train) #this train data set has 80% of data
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2648515 0.2450495 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 1.0073962 -0.9337227 -0.15302300 -0.8947010 0.5029708
## med_low -0.1045517 -0.3077623 -0.08835242 -0.5762983 -0.1247922
## med_high -0.4092818 0.2084425 0.16512651 0.4071857 0.0928016
## high -0.4872402 1.0171737 -0.07348562 1.0776070 -0.3978459
## age dis rad tax ptratio
## low -0.8987626 0.8958554 -0.6918543 -0.7352922 -0.4866235
## med_low -0.3897883 0.3762228 -0.5450244 -0.4789081 -0.0990381
## med_high 0.3742674 -0.4107759 -0.3751558 -0.2756626 -0.3037276
## high 0.8278444 -0.8465276 1.6375616 1.5136504 0.7801170
## black lstat medv
## low 0.37635193 -0.7882330670 0.56556249
## med_low 0.31961227 -0.1746746341 0.01137063
## med_high 0.07944298 0.0003490118 0.16930249
## high -0.90378469 0.9335209741 -0.73941569
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.090681280 0.833259477 -0.97698155
## indus 0.014712697 -0.290280553 0.23564356
## chas -0.080261083 -0.046409980 0.06512809
## nox 0.465231782 -0.660865066 -1.32983947
## rm -0.084324937 -0.002446593 -0.17808830
## age 0.215885740 -0.174176322 -0.14953275
## dis -0.055746042 -0.217740969 0.27337394
## rad 2.919728553 0.904097882 0.00481450
## tax -0.004027639 -0.009379552 0.59600517
## ptratio 0.123276293 0.088295584 -0.29127505
## black -0.149389178 -0.018300174 0.09849292
## lstat 0.273535897 -0.231281209 0.30624460
## medv 0.203753553 -0.405005684 -0.21496217
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9416 0.0429 0.0155
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
LD analysis Discussion: From the reduction plot, distribution of data into different clustering based on crime is clearly seen. Some of the med_high values are clustered well with high values than the manual clustering, it would be better to cluster them into high category. Also we can see other variables distribution according to crime rate in the dimensional biplot. For example variable rad is highly correlated with high crime rate.
#############################################################################################################################
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 14 1 0
## med_low 2 15 2 0
## med_high 1 12 14 0
## high 0 0 0 28
#compare the total number of new corrected one to old corrected one
#old uncorrected count
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#newly corrected count
table(correct_classes)
## correct_classes
## low med_low med_high high
## 28 19 27 28
#or number in each level in corrected ones
lda.fit$counts
## low med_low med_high high
## 99 107 99 99
Discussion of predicted model: Manually we divide them based on numeric range, whereas through LDA analysis we can cluster the levels that is best fitted with reference to other variables in dataset. As we can see previously 127 row in low group, whereas 29 of them are corrected and 98 remain in the cluster and so on for each cluster.
#################################################################################################################################
#1. load the Boston data
library(MASS)
data('Boston')
#2. Scale the Boston Data and standardize the dataset
scaled_Boston <- scale(Boston)
#3. calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(scaled_Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
#without scaling we get the following answer
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.119 85.624 170.539 226.315 371.950 626.047
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 2.016 149.145 279.505 342.899 509.707 1198.265
#4. Run K-means cluster
# k-means clustering
km <-kmeans(scaled_Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(scaled_Boston, col = km$cluster)
Discussion:
From the above kmeans clustering, the data is clustered based on the distance of variables. The pairs plot shows us the distrubution of clusters in each variable. Here i have divied into four center and it is showen in four different colors as seen in the picture. This clustering differ from LD analysis, which was done based on the catagorical variable wheres this is unsupervised clustering based on distance.
Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)